Method for translating programs for reconfigurable architectures

ABSTRACT

A method for translating high-level languages to reconfigurable architectures is disclosed. The method includes building a finite automaton for calculation. The method further includes forming a combinational network of a plurality of individual functions in accordance with the structure of the finite automaton. The method further includes allocating a plurality of memories to the network for storing a plurality of operands and a plurality of results.

1. INTRODUCTION

[0001] Paralleling compilers according to the prior art normally usespecial constructs such as semaphores and/or other methods forsynchronization. Typically, technology-related methods are used. Knownmethods are not suitable for combining functionally specifiedarchitectures with the associated time response and imperativelyspecified algorithm. The methods used will, therefore, only supplysatisfactory solutions in special cases.

[0002] Compilers for reconfigurable architectures normally use macroswhich have been specially generated for the particular reconfigurablehardware, hardware description languages (such as e.g. Verilog, VHDL,System-C) being used in most cases for generating the macros. Thesemacros are then called up (instanced) out of the program flow by anormal high-level language (e.g. C, C++).

[0003] The present patent describes a method for automatically mappingfunctionally or imperatively formulated computing rules onto differenttarget technologies, particularly onto ASICs, reconfigurable chips(FPGAs, DPGAs, VPUs, ChessArray, KressArray, Chameleon, etc.; combinedunder the term VPU in the text which follows), sequential processors(CISC/RISC CPUs, DSPs, etc.; combined under the term CPU in the textwhich follows) and parallel processor systems (SMP, MMP, etc.). In thisconnection, particular reference is made to the following patents andpatent applications by the same applicant: P 44 16 881.0-53, DE 197 81412.3, DE 197 81 483.2, DE 196 54 846.2-53, DE 196 54 593.5-53, DE 19704 044.6-53, DE 198 80 129.7, DE 198 61 088.2-53, DE 199 80 312.9,PCT/DE 00/01869, DE 100 36 627.9-33, DE 100 28 397.7, DE 101 10 530.4,DE 101 11 014.6, PCT/EP 00/10516, EP 01 102 674.7, PACT13, PACT17,PACT18, PACT22, PACT24, PACT25, PACT26US, PACT02, PACT04, PACT08,PACT10. These are herewith incorporated to their full extent forpurposes of disclosure.

[0004] VPUs basically consist of a multidimensional homogeneous orinhomogeneous flat or hierarchical arrangement (PA) of cells (PAEs)which can perform arbitrary functions, particularly logical and/orarithmetic functions and/or storage functions and/or network functions.The PAEs are associated with a loading unit (CT) which determines theoperation of the PAEs by configuration and possibly reconfiguration. Themethod is based on an abstract parallel machine model which, apart fromthe finite automaton, also integrates imperative problem specificationsand provides for an efficient algorithmic derivation of animplementation to different technologies.

2. DESCRIPTION

[0005] The basis for working through virtually any method for specifyingalgorithms is the finite automaton according to the prior art. FIG. 1shows the structure of a finite automaton. A simple finite automaton isdivided into a combinational network and a register stage fortemporarily storing data between the individual data processing cycles.

[0006] The finite automaton makes it possible to map complex algorithmsonto any sequential machines as shown in FIG. 2. The complex finiteautomaton shown consists of a complex combinational network, a memoryfor storing data and an address generator for addressing the data in thememory.

[0007] In principle, any sequential program can be interpreted as afinite automaton and in most cases a very large combinational network isproduced. For this reason, the combinational operations in theprogramming of traditional “von Neumann” architectures—i.e. in allCPUs—are split into a sequence of in each case individual simplepredetermined operations (OpCodes) in registers in the CPU. Thissplitting results in states for controlling the combinational operationsplit into a sequence, which states do not exist or are not neededwithin the original combinational operation. The states to be processedof a von Neumann machine can be distinguished in principle, therefore,from the algorithmic states of a combinational network, i.e. theregisters of finite automatons.

[0008] In contrast to the rigid OpCodes of CPUs, the VPU technology(essentially defined by the documents PACT01, PACT02, PACT03, PACT04,PACT05, PACT08, PACT10, PACT13, PACT17, PACT18, PACT22, PACT24, whichare incorporated to their full extent by reference) provides for theflexible configuration of complex combinational operations (complexinstruction) in accordance with the algorithm to be mapped.

[0009] 2.1 Operation of the Compiler

[0010] It is furthermore an essential operation of the compiler togenerate the complex instructions in such a manner that it is executedfor as long as possible in the PAE matrix without reconfiguration.

[0011] The compiler also generates the finite automaton from theimperative source text in such a manner that it can be executedoptimally in the PAE matrix.

[0012] The finite automaton is split into configurations.

[0013] The processing (interpreting) of the finite automaton is done ina VPU in such a manner that the configurations generated areprogressively mapped to the PAE matrix and the operating data and/orstates, which must be transmitted between the configurations, are storedin the memory. For this purpose, the method known from PACT04 or,respectively, the corresponding architecture can be used.

[0014] In other words, a configuration represents a plurality ofinstructions; a configuration determines the operation of the PAE matrixfor a multiplicity of clock cycles during which a multiplicity of datais processed in the matrix; these originate from a source external tothe VPU and/or an internal memory and are written to an external sourceand/or to an internal memory. The internal memories replace the set ofregisters of a CPU of the prior art in such a manner that e.g. aregister is represented by a memory and, according to the operatingprinciple of the VPU technology, it is not a data word which is storedper register but an entire data record per memory.

[0015] It is essential that the data and/or states of the processing ofa configuration being executed are stored in the memories and are thusavailable for the next configuration executed.

[0016] A significant difference from compilers paralleling on aninstruction basis consists in that the method emulates a combinationalnetwork on a PAE matrix whereas conventional compilers combine sequencesof instructions (OpCodes).

[0017] 2.2 Exemplary WHILE Language Embodiment

[0018] In the text which follows, the operation of the compiler will beillustrated by way of example by means of a simple language. Theprinciples of this language are already known from [reference “ArminNückel thesis”]. However, this only describes the mapping of a functionto a static combinational network. An essential innovation of theinvention is the mapping to configurations which are then mapped to thePAE matrix in a temporal sequence in accordance with the algorithm andthe states resulting during the processing.

[0019] The “WHILE” programming language is defined as follows;

[0020] Syntax: WHILE . . .

[0021] Constructs:

[0022] Instruction

[0023] Sequence of instructions

[0024] Loop

[0025] 2.2.1 Instructions

[0026] An instruction or a sequence of instructions can be mapped to acombinational network by the compiler method described.

[0027]FIG. 3a shows a combinational network with the associatedvariables. The content of one and the same variable (e.g. x1) can changefrom one stage (0301) of the network to the next (0302).

[0028] This change is shown by way of example for the assignment x1: =x1+1 in FIG. 3b.

[0029] 2.2.2 Addressing of Variables

[0030] For the purpose of addressing for reading the operands and forstoring the results, address generators can be synchronized with thecombinational network of the assignment. With each variable processed,corresponding new addresses are generated for operands and results (FIG.3c). In principle, the address generator can be of any type and dependson the addressing schemes of the compiled application. For operands andresults, common, combined or completely independent address generatorscan be implemented.

[0031] Since typically a plurality of data are processed within acertain configuration of the PAEs in the present data processing model,simple FIFO modes are available for most applications, at least for thedata memories, which are used for storing data and states of the dataprocessing (virtually as replacement of a conventional set of registersof conventional CPUs) within this description (compare PACT04).

[0032] 2.2.3 Sequences of Instructions

[0033] A sequence of the exemplary assignment can be generated asfollows (FIG. 4a):

[0034] x1:=0;

[0035] WHILE TRUE DO

[0036] x1:=x1 +1;

[0037] This sequence can now be mapped by means of an assignmentaccording to 2.2.1 and address generators for operands and results.

[0038] Finite Sequences

[0039] For the sake of completeness, a particular embodiment ofsequences from the defined constructs of the WHILE language will bediscussed. A finite sequence of the exemplary assignment can begenerated as follows:

[0040] FOR i:=1 TO 10

[0041] x1:=x1 +1;

[0042] Such a sequence can be implemented by two types:

[0043] a) By generating an adder for calculating i in accordance withthe WHILE construct (see 2.2.4) and a further adder for calculating x1.The sequence is mapped as a loop and calculated iteratively (FIG. 5a).

[0044] b) By rolling out the loop which dispenses with the calculationof i as a function. The calculation of x1 is instanced i-times and builtup as a pipeline which produces i concatenated adders (FIG. 5b).

[0045] 2.2.4 Conditions

[0046] Conditions can be expressed by means of WHILE. For example:

[0047] x1:=0;

[0048] WHILE x1<10 DO

[0049] x1 +L:=x1+1;

[0050] The mapping generates an additional PAE for processing thecomparison. The result of the comparison is represented by a statussignal (compare PACT08) which is evaluated by the PAEs processing theinstruction and the address generators.

[0051] The resultant mapping is shown in FIG. 4b. Evaluation of thecondition (here WHILE, generally also IF; CASE) generates a status whichcan be provided to the subsequent data processing (PACT08) and/or sentto the CT or a local load control (PACT04) which derives from thisinformation on the further program flow and any reconfigurations whichmay be present (PACT04, PACT05, PACT10, PACT13, PACT17).

[0052] 2.2.5 Basic Method

[0053] According to the previous method, each program can be mapped in asystem which is built up as follows:

[0054] 1. Memory for operands

[0055] 2. Memory for results

[0056] 3. Address generator(s)

[0057] 4. Network of a) assignments and/or b) While instructions.

[0058] 2.2.6 Handling States

[0059] A distinction is made between algorithmically relevant andirrelevant states.

[0060] Relevant states are necessary within the algorithm for describingits correct operation. They are essential to the algorithm.

[0061] Irrelevant states are produced by the hardware used and/or by theselected mapping or from other secondary reasons. They are essential forthe mapping (i.e. hardware).

[0062] It is only the relevant states which must be obtained with thedata. For this reason, they are stored together with the data in thememories since they occurred either as a result of the processing withthe data or are necessary as operands with the data for the nextprocessing cycle.

[0063] In contrast, irrelevant states are necessary only locally andlocally in time and do not, therefore, need to be stored.

EXAMPLE

[0064] a) The state information of a comparison is relevant for thefurther processing of the data since it determines the functions to beexecuted.

[0065] b) Assuming a sequential divider is produced, for example, bymapping a division instruction to hardware which only supports thesequential division. This results in a state which identifies themathematical step within the division. This state is irrelevant sinceonly the result (i.e. the division carried out) is required for thealgorithm. In this case, only the result and the time information (i.e.the availability) are thus needed.

[0066] The time information can be obtained by the RDY/ACK handshake,for example in the VPU technology according to PACT01, 02, 13. However,it must be especially noted in this regard that the handshake also doesnot represent a relevant state since it only signals the validity of thedata as a result of which the remaining relevant information, in turn,is reduced to the existence of valid data.

[0067] 2.2.7 Handling Time

[0068] In many programming languages, particularly in sequential onessuch as e.g. C, a precise temporal order is implicitly predetermined bythe language. In sequential programming languages, for example, by theorder of the individual instructions.

[0069] If required by the programming language and/or the algorithm, thetime information can be mapped to synchronization models such as RDY/ACKand/or REQ/ACK or a time stamp method according to PACT18.

[0070] 2.3 Macros

[0071] More complex functions of a high-level language such as e.g.loops are implemented by macros. The macros are predetermined by thecompiler and instanced at translation time (compare FIG. 4).

[0072] The macros are built up either of simple language constructs ofthe high-level language or at assembler level. Macros can beparameterized in order to provide for a simple adaptation to thealgorithm described (compare FIG. 5, 0502).

[0073] 2.4 Feedback Loops and Registers

[0074] Undelayed feedbacks which resonate uncontrolled can arise withinthe mapping of an algorithm into a combinational network.

[0075] In VPU technologies according to PACT02, this is prevented by thestructure of the exemplary PAE in that at least one register ispermanently defined in the PAEs for the purpose of decoupling.

[0076] In general, undelayed feedbacks can be detected by analyzing thegraph of the combinational network produced. Registers for decouplingare then inserted collectively into the data paths in which an undelayedfeedback exists.

[0077] The correct operation of the calculation is also ensured byinserting registers by using handshake protocols (e.g. RDY/ACK accordingto 2.2.7).

[0078] 2.5 Time Domain Multiplexing (TDM)

[0079] In principle, any PAE matrix implemented in practice has only onefinite quantity. For this reason, a partitioning of the algorithmaccording to 2.2.5 para 4 a/b into a plurality of configurations whichare successively configured into the PAE matrix must be performed in thesubsequent step. The aim is to calculate as many data packets aspossible in the network without having to reconfigure.

[0080] Between the configurations, a buffer memory is introducedwhich—similar to a register in the case of CPUs—stores the data betweenthe individual configurations executed sequentially.

[0081] In other words, in VPU technology, it is not an OpCode which issequentially executed but complex configurations. Whereas, in the caseof CPUs, an OpCode typically processes a data word, a plurality of datawords (a data packet) are processed by a configuration in the VPUtechnology. As a result, the efficiency of the reconfigurablearchitecture increases due to a better relationship betweenreconfiguration effort and data processing.

[0082] In the VPU technology, a memory is used instead of a registersince it is not data words but data packets which are processed betweenthe configurations. This memory can be constructed as random accessmemory, stack, FIFO or any other memory architecture, a FIFO typicallyproviding the best one and the one which is implemented most easily.

[0083] Data are then processed by the PAE matrix in accordance with thealgorithm configured and stored in one or more memories. The PAE matrixis reconfigured after the processing of a set of data and the newconfiguration takes the intermediate results from the memory(ies) andcontinues the execution of the program. In the process, new data canalso easily flow additionally into the calculation from externalmemories and/or to the peripherals, and results can likewise be writtento external memories and/or to the peripherals.

[0084] In other words, the typical sequence of data processing is thereading out of internal RAMs, the processing of the data in the matrixand writing the data into the internal memories, and arbitrary externalsources can be easily used for data processing or destinations used fordata transfers in addition to or instead of internal memories.

[0085] Whereas “sequencing” in CPUs is defined as the reloading of anOpCode, “sequencing” of VPUs is defined as the (re)configuring ofconfigurations.

[0086] The information when and/or how (i.e. which is the nextconfiguration that should be configured) sequencing takes place can berepresented by various information items which can be used individuallyor in combination. E.g. the following strategies are appropriate forderiving the information:

[0087] a) defined by the compiler at translation time

[0088] b) defined by the event network (Trigger, PACT08)

[0089] c) defined by the fill ratio of the memories (Trigger, PACT08,PACT 04).

[0090] 2.5.1 Influence of the TDM on the Processor Model

[0091] The partitioning of the algorithm decisively determines therelevant states which are stored in the memories between the variousconfigurations. If a state is only relevant within a configuration(locally relevant state), it is not necessary to store it.

[0092] Nevertheless, it is useful to store these states for the purposeof debugging of the program to be executed in order to provide thedebugger with access to these states. This necessity is described ingreater detail in the debugging method (PACT21) at the same date.Furthermore, states can become relevant additionally if a task switchmechanism is used (e.g. by an operating system or interrupt sources) andcurrent configurations executed are interrupted, other configurationsare loaded and the aborted configuration is to be continued at a latertime. A more detailed description follows in the next section.

[0093] A simple example is to be used for demonstrating thediscriminating feature for locally relevant states:

[0094] a) A branch of the type “if ( ) then . . . else . . . ” fitscompletely into a single configuration, i.e. both data paths (branches)are mapped completely within the configuration. The state resulting froma comparison is relevant but local since it is no longer needed in thesubsequent configurations.

[0095] b) The same branching is too large to fit completely into asingle configuration. A number of configurations are necessary formapping the complete data paths. In this case, the state is globallyrelevant and must be stored and allocated to the respective data sincethe subsequent configurations need the respective state of thecomparison during the further processing of the data.

[0096] 2.6 Task Switching

[0097] The possible use of an operating system has an additionalinfluence on the observation and handling of states. Operating systemsuse, for example, task schedulers for administering a number of tasks inorder to provide multitasking.

[0098] Task schedulers terminate tasks at a particular time, start othertasks and return to the further processing of the aborted task after theother ones have been processed. If it is ensured that aconfiguration—which corresponds to the processing of a task—terminatesonly after the complete processing—i.e. when all data and states to beprocessed within this configuration cycle are stored, locally relevantstates can remain unstored.

[0099] If, however, the task scheduler terminates configurations beforethey have been completely processed, local states and/or data must bestored. Furthermore, this is of advantage if the processing time of aconfiguration cannot be predicted. This also appears useful inconjunction with the known holding problem and the risk that aconfiguration will not terminate (e.g. due to a fault) in order toprevent by this means a deadlock of the entire system.

[0100] In other words, taking into consideration task switching,relevant states must also be considered to be those which are necessaryfor task switching and for a new correct start of the data processing.

[0101] In the case of a task switch, the memory for results and possiblyalso the memory for the operands must be saved and established again ata later time, that is to say on return to this task. This can be donesimilarly to the PUSH/POP instructions and methods of the prior art.Furthermore, the state of the data processing must be saved, i.e. thepointer to the last operands completely processed. Special reference ismade here to PACT18.

[0102] Depending on the optimization of the task switch, there are twopossibilities, for example:

[0103] a) The terminated configuration is reconfigured and only theoperands are loaded. Data processing begins once again as if theprocessing of the configuration has not yet been begun at all. In otherwords, all data calculations are simply executed from the beginning andcalculations may already have been performed previously. Thispossibility is simple but not very efficient.

[0104] b) The terminated configuration is reconfigured and the operandsand results already calculated have been loaded into the respectivememories. The data processing is continued at the operands which havenot been completely calculated. This method is very much more efficientbut presupposes that additional states which occur during the processingof the configuration may become relevant, for example at least onepointer to the last operands completely miscalculated must be saved sothat it is possible to start again with successors after completedreconfiguration.

[0105] 2.7 Algorithmic Optimization

[0106] The translation method described separates control structuresfrom algorithmic structures. For example a loop is split into a body(WHILE) and an algorithmic structure (instructions).

[0107] The algorithmic structures can then be optionally optimized by anadditional tool following the separation.

[0108] For example, a subsequent algebra software can optimize andminimize the programmed algorithms. Such tools are known, e.g. by AXIOM,MARBLE, etc. Due to the minimization, a quicker execution of thealgorithm and/or a considerably reduced space requirement can beachieved.

[0109] The result of the optimization is then conducted back into thecompiler and processed further accordingly.

3. APPLICABILITY FOR PROCESSORS OF THE PRIOR ART, PARTICULARLY WITH VLIWARCHITECTURE

[0110] It should be noted particularly that, instead of a PAE matrix, anarrangement of arithmetic logic units (ALUs) of the prior art such asnormally used, for example, in VLIW processors and/or an arrangement ofcomplete processors such as normally used, for example, inmultiprocessor systems, can also be used. The use of an individual ALUrepresents a special case so that the method can also be used for normalCPUs.

[0111] In the dissertation [reference Armin Nückel Dissertation], amethod was developed which provides for the translation of the WHILElanguage into semantically correct finite automatons. Beyond that, afinite automaton can be used as a “subroutine” and conversely. Thisprovides the possibility of mapping a configuration to differentimplementation technologies such as, e.g. CPUs, symmetricmultiprocessors, FPGAs, ASICs, VPUs.

[0112] In particular, it is possible to allocate the in each caseoptimally suited hardware to parts of an application. In other words, adata flow structure, for example, would be transferred to a data flowarchitecture whereas a sequential structure is mapped to a sequencer.

[0113] The problems arising with resource allocations for the individualalgorithms can be solved, e.g. by the job assignment algorithm foradministering the allocation.

4. DESCRIPTION OF THE FIGURES

[0114] The figures following show exemplary implementations andembodiments of the compiler.

[0115]FIG. 1a shows the structure of a normal finite automaton in whicha combinational network (0101) is combined with a register (0102). Datacan be conducted directly to 0101 (0103) and 0102 (0104). By feedingback (0105) the register to the combinational network, a state can beprocessed in dependence on the previous states. The processing resultsare represented by 0106.

[0116]FIG. 1b shows a representation of the finite automaton by areconfigurable architecture according to PACT01 and PACT04 (PACT04 FIGS.12-15). The combinational network from FIG. 1a (0101) is replaced by anarrangement of PAEs 0107 (0101 b). The register (0102) is implemented bya memory (0102 b) which can store a number of cycles. The feedbackaccording to 0105 is carried out by 0105 b. The inputs (0103 b and 0104b, respectively) are equivalent to 0103 and 0104, respectively. Thedirect access to 0102 b may be implemented by a bus through the array0101 b. The output 0106 b is again equivalent to 0106.

[0117]FIG. 2 shows the mapping of a finite automaton to a reconfigurablearchitecture. 0201(x) represent the combinational network (which can beconstructed as PAEs according to FIG. 1b). There is one or more memoriesfor operands (0202) and one or more memories for results (0203).Additional data inputs/outputs according to 0103 b, 0104 b, 0106 b) arenot shown for the sake of simplicity. An address generator (0204, 0205)is in each case allocated to the memories.

[0118] The operand and result memories (0202, 0203) are physically orvirtually coupled to one another in such a manner that, for example, theresults of a function can be used as operands by another one and/orresults and operands of a function can be used as operands by anotherone. Such coupling can be established, for example, by bus systems or bya (re)configuration by means of which the function and networking of thememories with the 0201 is reconfigured.

[0119]FIG. 3 shows various aspects for dealing with variables. In FIG.3a 0301, 0302, 0303 show various stages of the calculation. These stagescan be purely combinational or also separated from one another viaregisters. f1, f2, f3 are functions, x1 is the variable according to thedescription of the patent.

[0120]FIG. 3b shows the variable x1 for the function x1:=x1+1.

[0121]FIG. 3c shows the behavior of a finite automaton calculatingx1:=x1+1 within a configuration. In the next configuration, 0306 and0304 must be exchanged in order to obtain a complete finite automaton.0305 represents the address generators for the memory 0304 and 0306.

[0122]FIG. 4 shows implementations of loops. The shaded modules can begenerated by macros (0420, 0421). 0421 can also be inserted by analyzingthe graphs for undelayed feedbacks.

[0123]FIG. 4a shows the implementation of a simple loop of the type

[0124] WHILE TRUE DO

[0125] x1:=x1+1;

[0126] at the heart of the loop, the counter +1 (0401) is located. 0402is a multiplexer which, at the beginning, conducts the starting value ofx1 (0403) to 0401 and then the feedback (0404 a, 0404 b) with eachiteration. A register (0405) is inserted into the feedback in order toprevent any undelayed, and thus uncontrolled feedback of the output of0401 to its input. 0405 is clocked with the operating clock of the VPUand thus determines the iterations per unit time. The respective countcould be picked up at 0404 a or 0404 b. However, the loop is notterminated depending on the definition of the high-level language. Forexample, 0404 could be used in a HDL (according to the prior art (e.g.VHDL, Verilog)), whereas 0404 cannot be used in a sequential programminglanguage (e.g. C) since the loop does not terminate and thus does notsupply an exit value.

[0127]0402 is produced as a macro from the loop construct. The macro isinstanced by the translation of WHILE. 0405 is either also part of themacro or is inserted precisely when and where an undelayed feedbackexists in accordance with an analysis of the graphs of the prior art.

[0128]FIG. 4b shows the structure of a genuine loop of the type

[0129] WHILE

[0130] x1<10 DO

[0131] x1:=x1+1;

[0132] the structure corresponds basically to FIG. 4a which is why thesame references have been used.

[0133] In addition, there is a circuit which checks the validity of theresult (0410) and only forwards 0404 a to the subsequent functions(0411) when the termination criterion of the loop has been reached. Thetermination criterion is detected by the comparison x1<10 (0412). As aresult of the comparison, the relevant status flag (0413) is conductedto 0402 for controlling the loop and 0411 for controlling the forwardingof the result. 0413 can be implemented, for example, by triggersaccording to PACT08. Similarly, 0413 can be sent to a CT which thereupondetects the termination of the loop and performs a reconfiguration.

[0134]FIG. 5a shows the iterative calculation of

[0135] FOR

[0136] i:=1 TO 10

[0137] x1:=x1 * x1;

[0138] essentially, the basic function corresponds to FIG. 4b which iswhy the references have been adopted. 0501 calculates themultiplication. The FOR loop is implemented by a further loop accordingto FIG. 4b and is only indicated by 0503. 0503 supplies the status ofthe comparison for the termination criterion. The status is directlyused for driving the iteration as a result of which 0412 (represented by0502) is largely unnecessary.

[0139]FIG. 5b shows the rolling out of the calculation of

[0140] FOR

[0141] i:=1 TO 10

[0142] x1;=x1 * x1;

[0143] since the number of iterations at translation time is preciselyknown, the calculation can be mapped to a sequence of i multipliers(0510).

[0144]FIG. 6 shows the execution of a WHILE loop according to FIG. 4bover a number of configurations. The state of the loop (0601) is here arelevant state since it significantly influences the function in thesubsequent configurations. The calculation extends over 4 configurations(0602, 0603, 0604, 0605). Between the configurations, the data arestored in memories (0606, 0607). 0607 also replaces 0405.

[0145] As a reconfiguration criterion, the fill ratio of the memories(0606, 0607: memory full/empty) and/or 0601, which indicates thetermination of the loop, is used. In other words, the fill ratio of thememories generates triggers (compare PACT01, PACT05, PACT08, PACT10)which are sent to the CT and trigger a reconfiguration. The state of theloop (0601) can also be sent to the CT. The CT can then configure thesubsequent algorithms when the termination criterion is reached orpossibly first process the remaining parts of the loop (0603, 0604,0605) and then load the subsequent configurations.

5. LIMITS OF PARALLELABILITY

[0146]FIG. 6 shows the limits of parallelability.

[0147] a) If the calculation of the operands is independent of thefeedback 0608, the loop can be calculated in blocks, i.e. in each caseby filling the memories 0606/0607. This results in a high degree ofparallelism.

[0148] b) If the calculation of an operand is dependent on the result ofthe previous calculation, that is to say 0608 is included in thecalculation, the method becomes more inefficient since in each case onlyone operand can be calculated within the loop.

[0149] If the usable ILP (Instruction Level Parallelism) within the loopis high and the time for reconfiguration is low (compare PACT02, PACT04,PACT13, PACT17), a calculation rolled out to PAEs can still be efficienton a VPU.

[0150] If this is not the case, it is useful to map the loop to asequential architecture (a separate processor from or implementationwithin the PA according to PACT02, PACT04 and especially PACT13 (FIGS.5, 11, 16, 17, 23, 30, 31, 33)).

[0151] The calculation times can be analyzed either at translation timein the compiler in accordance with the next section or can be measuredempirically at run time and subsequently optimized.

6. ANALYSIS AND PARALLELING METHOD

[0152] For the analysis and performance of the paralleling, variousmethods of the prior art are available.

[0153] In the text which follows, a preferred method will be described.

[0154] Functions to be mapped, where an application can be composed ofan arbitrary number of different functions, are represented by graphs(compare PACT13). The graphs are examined for the parallelism containedin them and all methods for optimizing can be used ab initio.

[0155] For example, the following investigations are to be performed:

[0156] 6.0.1 ILP (Instruction Level Parallelism)

[0157] ILP expresses which instructions can be executed at the sametime. Such an analysis is possible in a simple manner on the basis ofdependences of nodes in a graph. Corresponding methods are sufficientlywell known in accordance with the prior art and in mathematics.Reference is made, for example, to VLIW compilers and synthesis tools.

[0158] Special attention needs to be paid to e.g. possible interleavedconditional executions (IF) since a correct statement of the path whichcan be executed in parallel can frequently be scarcely made or not atall since there is a great dependence on the value space of theindividual parameters which is frequently not known or onlyinadequately. A precise analysis can also pick up such an amount ofcomputing time that it-can no longer be usefully performed.

[0159] In such cases, the analysis can be simplified, for example, bynotes by the programmer and/or it is possible to work in such a mannerby means of corresponding compiler switches that in the case of doubt,the starting point has to be either a high parallelability (possible bylosing resources) or a lower parallelability (possibly by losingperformance). As well, an empirical analysis can be performed at runtime in these cases. According to PACT10, PACT17, methods are knownwhich allow statistics about the program behavior at run time. In thismanner, a maximum parallelability can be initially assumed, for example.The individual paths report each pass back to a statistics unit (e.g.implemented in a CT (compare PACT10, 17) and in principle, unitsaccording to PACT04 can also be used). It can now be analyzed by meansof statistical measures which paths are actually passed in parallel.Furthermore, there is the possibility of using the data at run time forevaluating which paths are passed frequently or rarely or never inparallel.

[0160] Accordingly, it is possible to optimize with a next program call.It is known from PACT22, PACT24 that a number of configurations can beconfigured either at the same time and then are driven by triggers(PACT08) or only a subset is configured and the remaining configurationsare later loaded when required due to the fact that the correspondingtriggers are sent to a loading unit (CT, PACT10).

[0161] The value PAR(p) used in the text which follows specifies for thepurpose of illustration how much ILP can be achieved at a certain stage(p) within the data flow graph transformed from the function (FIG. 7a).

[0162] 6.0.2 Vector Parallelism

[0163] Vector parallelism is useful if relatively large amounts of dataare to be processed. In this case, the linear sequences of operationscan be vectorized, i.e. all operations can simultaneously process data,each separate operation typically processing a separate data word.

[0164] This procedure is in some cases not possible within loops. Forthis reason, analyses and optimizations are necessary. For example, thegraph of a function can be expressed by a Petri network. Petri networkshave the property that the forwarding of results from nodes iscontrolled as a result of which, for example, loops can be modeled.

[0165] Feeding the result back in a loop determines the data throughput.Examples:

[0166] The result of the calculation n is needed for calculation n+1:only one calculation can be executed within the loop.

[0167] The result of the calculation n is needed for calculation n+m:m−1 calculations can be executed within the loop.

[0168] The result determines the termination of the loop but does notenter into the calculation of the results: no feedback is necessary.However, false (too many) values may enter the loop, the output of theresults can be interrupted immediately when the end condition has beenreached at the loop end.

[0169] Before loops are analyzed, they can be optimized according to theprior art. For example, all possible instructions can be extracted fromthe loop and placed in front of or after the loop.

[0170] The value VEC used for illustration in the text which followscharacterizes the degree of vectorizability of a function. In otherwords VEC indicates how many data words can be processed simultaneouslywithin a set of operations. VEC can be calculated, for example, from thenumber of arithmetic logic units needed for a function n_(nodes) and ofthe data n_(data) which can be calculated at the same time within thevector, e.g. by VEC=n_(nodes)/n_(data).

[0171] If a function can be mapped, for example up to 5 arithmetic logicunits (n_(nodes)=5) and data can be processed at the same time in eachof the arithmetic logic units (n_(data)=5), VEC=1 (FIG. 7b). If afunction can be mapped, for example onto 5 arithmetic logic units(n_(nodes)=5) and data can be processed in each case in only onearithmetic logic unit, e.g. due to a feedback of the results of thepipeline to the input (n_(data)=5), VEC={fraction (1/5)} (FIG. 7c).

[0172] VEC can be calculated for an entire function and/or forpart-sections of a function.

[0173] 6.1 Evaluation of PAR and VEC

[0174] According to FIG. 7a, PAR(p) is determined for each row of agraph. A row of a graph is defined by the fact that it is executedwithin one clock unit. The number of operations depends on theimplementation of the respective VPU.

[0175] If PAE(p) corresponds to the number of nodes in the row p, allnodes can be executed in parallel.

[0176] If PAR(p) is smaller, certain nodes are only executed inalternation. The alternative executions of in each case one node arecombined in each case in one PAE. A selection device enables thealternative corresponding to the status of the data processing to beactivated at run time as described, for example, in PACT08.

[0177] VEC is also allocated to each row of a graph. If VEC=1 for onerow, this means that the row remains in existence as pipeline stage. Ifa row is less than 1, all subsequent rows which are also less than 1 arecombined since pipelining is not possible. According to the order ofoperations, these are combined to form a sequence which is thenconfigured in a PAE and is sequentially processed at run time.Corresponding methods are known, for example, from PACT02 and/or PACT04.

[0178] 6.1.1 Parallel Processor Models and Reentrant Code

[0179] Using the method described, parallel processor models or anycomplexity can be built up by grouping sequencers. In particular,sequencer structures for mapping reentrant code can be generated.

[0180] The synchronizations necessary in each case for this purpose canbe performed, for example, by the time stamp method described in PACT18.

[0181] 6.2 Influence on Clocking

[0182] If a number of sequences or sequential parts are mapped onto aPA, it is then useful to match the power of the individual sequences toone another for reasons of power consumption. This can be done in such amanner that the operating frequencies of the sequencers are adapted toone another. For example, methods are known from PACT25 and PACT18 whichallow individual clocking of individual PAEs or PAE groups.

[0183] The frequency of a sequencer is determined by means of the numberof cycles which it typically needs for processing its assigned function.

[0184] If, for example, it needs 5 clock cycles for processing itsfunction, its clocking should be 5-times higher than the clocking of theremaining system.

[0185] 6.3 Partitioning and Scheduling

[0186] Functions are partitioned in accordance with the aforementionedmethod. During partitioning, memories for data and relevant status arecorrespondingly inserted. Other alternative and/or more extensivemethods are known from PACT13 and PACT18.

[0187] Some VPUs offer the possibility of differential reconfigurationaccording to PACT01, PACT10, PACT13, PACT17, PACT22, PACT24. This can beapplied if only relatively few changes become necessary within thearrangement of PAEs during a reconfiguration. In other words, only thechanges of a configuration compared with the current configuration arereconfigured. In this case, the partitioning can be such that the(differential) configuration following a configuration only contains thenecessary reconfiguration data and does not represent a completeconfiguration.

[0188] The reconfiguration can be scheduled by the status which reportsfunction(s) to a loading unit (CT) which selects and configures the nextconfiguration or part-configuration on the basis of the incoming status.In detail, such methods are known from PACT01, PACT05, PACT10, PACT13,PACT17.

[0189] Furthermore, the scheduling can support the possibility ofpreloading configurations during the run time of another configuration.In this arrangement, a number of configurations can possibly also bepreloaded speculatively, i.e. without ensuring that the configurationsare needed at all. The configurations to be used are then selected atrun time by means of selection mechanisms according to PACT08 (see alsoexample NLS in PACT22/24).

[0190] The local sequences can also be controlled by the status of theirdata processing as is known from PACT02, PACT04, PACT13. To carry outtheir reconfiguration, a further dependent or independent status can bereported to the CT (see, for example, PACT04, LLBACK).

[0191] 6.4 Description of the Figures

[0192] In the text which follows, the following symbols are used forsimplifying the notation:

or,

and

[0193]FIG. 8a shows the mapping of the graph of FIG. 7a onto a group ofPAEs with maximum achievable parallelism. All operations (Instructioni1-i12) are mapped into individual PAEs.

[0194]FIG. 8b shows the same graph, for example with maximum usablevectorizability. However, the sets of operations V2=(i1, i3), V3=(i4,i5, i6, i7, i8), V4=(i9, i10, i11) are not parallel (par({2,3,4})=1.This allows resources to be saved by in each case allocating one set P2,P3, P4 of operations to one PAE. The operations to be executed in therespective PAE are selected by a status signal for each data word ineach stage. The PAEs are networked as pipeline (vector) and each PAEperforms one operation per clock cycle over in each case different datawords.

[0195] Sequence:

[0196] PAE1 calculates data and forwards them to PAE2. Together with thedata, it forwards a status signal which indicates whether i1 or i2 is tobe executed.

[0197] PAE2 further calculates the data of PAE1. The operation to beexecuted (i1, i2) is selected and calculated in accordance with theincoming status signal. In accordance with the calculation, PAE2forwards a status signal to PAE3 which indicates whether (i4

i5)

(i6

i7

i8) is to be executed.

[0198] PAE3 further calculates the data of PAE2. The operation to beexecuted (i4

i5)

(i6

i7

i8) is selected and calculated in accordance with the incoming statussignal. In accordance with the calculation, PAE3 forwards a statussignal to PAE4 which indicates whether i9

i10

i11 is to be executed.

[0199] PAE4 further calculates the data of PAE3. The operation to beexecuted i9

i10

i11 is selected and calculated in accordance with the incoming statussignal.

[0200] PAE5 further calculates the data of PAE4.

[0201] A possible corresponding method is described in PACT08 (FIGS. 5and 6); PACT04 and PACT10, 13 also describe generally usable methodswhich, however, are more elaborate.

[0202]FIG. 8c again shows the same graph. In this example, vectorizationis not possible but PAR(p) is high which means that in each case amultiplicity of operations can be executed simultaneously within onerow. The operations which can be performed in parallel are P2 ={i1

i2}, P3 ={i4

i5 1

i6

i7

i8), P4 =(i9

i10

i11}. The PAEs are networked in such a manner that they can arbitrarilyexchange any data with one another. The individual PAEs only performoperations if there is an ILP in the corresponding cycle and areotherwise neutral (NOP) and, if necessary, the clock and/or power can beswitched of f in order to minimize the power dissipation.

[0203] Sequence:

[0204] In the first cycle, only PAE2 is operating and forwards the datato PAE2 and PAE3.

[0205] In the second cycle, PAE2 and PAE3 are operating in parallel andforward their data to PAE1, PAE2, PAE3, PAE4, PAE5.

[0206] In the third cycle, PAE1, PAE2, PAE3, PAE4, PAE5 are operatingand forward the data to PAE2, PAE3, PAE5.

[0207] In the fourth cycle, PAE2, PAE3, PAE5 are operating and, forwardthe data to PAE2.

[0208] In the fifth cycle only PAE2 is operating.

[0209] The function thus needs 5 cycles for calculation. Thecorresponding sequencer should thus operate with 5-times the clock inrelationship to its environment in order to achieve a correspondingperformance.

[0210] A possible corresponding method is described in PACT02 (FIGS. 19,20 and 21); PACT04 and PACT10, 13 also describe generally usable methodswhich, however, are more elaborate.

[0211]FIG. 8d shows the graph of FIG. 7a for the case where there is nousable parallelism at all. To calculate a data word, each stage must besuccessively passed. Within the stages, only exactly one of the branchesis always processed.

[0212] The function also needs 5 cycles for the calculation, cy1 = (i1),cy2 = (i2

i3), cy3 = (i4

i5

i6

i7

i8), cy4 = (i9

i10

i11), cy5 = (i12). The corresponding sequencer should thus operate at5-times the clock in relationship to its environment in order to achievea corresponding performance.

[0213] Such a function can be mapped, for example, similarly to FIG. 8cby means of a simple sequencer according to PACT02 (FIGS. 19, 20 and21). PACT04 and PACT10, 13 generally describe usable methods which,however, are more elaborate.

[0214] The mappings shown in FIG. 8 can be mixed and grouped asrequired.

[0215] In FIG. 9a for example, the same function is shown in which thepath (i2

(i4

i5)

i9) and (i3

, (i6

i7

i5)

(i9

i10)) can be executed in parallel. (i4

i5), i6

i7

i8), (i9

i10) are in each case alternating. The function can also be vectorized.It thus makes it possible to build up a pipeline-in which 3 PAEs (PAE4,PAE5, PAE7) in each case determine the function to be executed by themin each case by means of status signals.

[0216]FIG. 9b shows a similar example in which no vectorization ispossible. However, the paths (i1 ^ A i2 ^ (i4 v i5) ^ i9 ^ i12) and (i3^ (i6 v i7 v i8) ^ (i10 v i11) are in parallel.

[0217] This makes it possible to achieve the optimum performance byusing two PAEs which also process the parallel paths in parallel. ThePAEs are synchronized to one another by means of status signals whichare preferably generated by PAE1 since it calculates the beginning (i1)and the end (i12) of the function.

[0218] It should be pointed out particularly that a multiple arrangementof sequencers can result in a symmetric parallel processor model (SMP)or similar multi-processor models currently used.

[0219] Furthermore, it should be pointed out that all configurationregisters for the scheduling can also be loaded with new configurationsin the background and during the data processing. In detail:

[0220] Method according to PACT02:

[0221] Independent storage areas or registers are available which can beexecuted independently. Certain places are jumped to by incomingtriggers and jumping is also possible by means of jump instructions(JMP, CALL/RET) which may also be conditionally executable.

[0222] Method According to PACT04:

[0223] Write and read pointers are independently available as a resultof which, in principle, an independence and thus the possibility ofaccess in the background, are given. In particular, it is possible tosegment the memories as a result of which additional independence isgiven. Jumping is possible by means of jump instructions (JMP, CALL/RET)which may also be conditionally executable.

[0224] Method According to PACT08:

[0225] The individual registers which can be selected by the triggersare basically independent and therefore allow an independentconfiguration, particularly in the background. Jumps withinthe-registers-are not possible and selection takes place exclusively viathe trigger vectors.

1. Method for translating high-level languages to reconfigurablearchitectures, characterized in that a finite automaton for calculationis built up in such a manner that a complex combinational network of theindividual functions is formed and memories are allocated to the networkfor storing the operands and results.